Skip to main content

CPython in Python 3.13

The Fastest Python in History - What Actually Changed

Between Python 3.10 and 3.13, CPython became dramatically faster without any changes to the language itself. The numbers:

VersionBenchmark speedup vs 3.10Primary change
3.10baseline-
3.11~25% fasterSpecialising adaptive interpreter, new frame layout
3.12~5-10% fasterImmortal objects, specialisation improvements
3.13~5% faster (GIL)Further specialisation, JIT foundation work
3.13tpotentially >2x for multithreaded CPU-boundFree-threaded (no GIL)

These speedups are remarkable because they are free - you run python3.11 instead of python3.10 and your code is faster with no changes.

Understanding what changed tells you how Python's performance will continue to evolve and what assumptions you can rely on in your code.

import sys
import timeit

print(f"Python {sys.version}")

# A benchmark that benefits from 3.11+ improvements
def benchmark():
total = 0
for i in range(1_000_000):
total += i
return total

t = timeit.timeit(benchmark, number=10)
print(f"Sum 1M integers × 10: {t:.3f}s")
# Python 3.10: ~1.2s
# Python 3.11: ~0.9s (25% faster)
# Python 3.12: ~0.85s (5% more)
# Python 3.13: ~0.82s (3% more)

Python 3.11: The Specialising Adaptive Interpreter

The headline feature of Python 3.11 is the specialising adaptive interpreter, also called the "quickening" mechanism (PEP 659). It is the most significant CPython performance improvement since Python 3.0.

How Specialisation Works

Every opcode site in a running program has a specialisation counter. Initially, all opcodes are "adaptive" - they execute slowly but observe the types of their operands:

At function entry (first call):
All BINARY_OP (+) opcodes are in ADAPTIVE state

BINARY_OP (+) executes with operands (int, int):
1. Increment specialisation counter
2. Execute via slow generic path (type dispatch through tp_as_number)

After ~8 calls:
Counter threshold reached
Examine: both operands were always int
Replace BINARY_OP (+) with BINARY_OP_ADD_INT (specialised opcode)

Subsequent calls:
BINARY_OP_ADD_INT executes with operands (int, int):
1. Fast type check: isinstance(left, int) and isinstance(right, int)?
2. YES: call long_add() directly - skip all type dispatch
3. NO: deoptimise back to BINARY_OP (+), reset counter

The specialisation table for BINARY_OP:

// From Python/specialize.c (simplified)
void
_Py_Specialize_BinaryOp(PyObject *lhs, PyObject *rhs, _Py_CODEUNIT *instr,
int oparg, PyObject **locals)
{
PyTypeObject *ltype = Py_TYPE(lhs);
PyTypeObject *rtype = Py_TYPE(rhs);

if (ltype == &PyLong_Type && rtype == &PyLong_Type) {
switch (oparg) {
case NB_ADD: instr->op.code = BINARY_OP_ADD_INT; return;
case NB_SUBTRACT: instr->op.code = BINARY_OP_SUBTRACT_INT; return;
case NB_MULTIPLY: instr->op.code = BINARY_OP_MULTIPLY_INT; return;
}
}
if (ltype == &PyFloat_Type && rtype == &PyFloat_Type) {
switch (oparg) {
case NB_ADD: instr->op.code = BINARY_OP_ADD_FLOAT; return;
// ...
}
}
if (ltype == &PyUnicode_Type && rtype == &PyUnicode_Type) {
if (oparg == NB_ADD) { instr->op.code = BINARY_OP_ADD_UNICODE; return; }
}
// No specialisation available: mark as BINARY_OP_GENERIC (fast generic path)
}

Complete Specialisation Inventory (3.11-3.13)

import opcode

# All specialised opcodes available in your Python version
specialised = {name: code for name, code in opcode.opmap.items()
if any(base in name for base in
['_INT', '_FLOAT', '_UNICODE', '_SLOT', '_BUILTIN',
'_FAST', '_MODULE', '_HINT', '_PY_', '_ADAPTIVE'])}
for name in sorted(specialised):
print(f" {name}: {specialised[name]}")

Key specialised opcodes and their speedups:

Specialised OpcodeGeneric formSpeedupCondition
BINARY_OP_ADD_INTBINARY_OP +~2xBoth operands int
BINARY_OP_ADD_FLOATBINARY_OP +~1.5xBoth operands float
LOAD_GLOBAL_MODULELOAD_GLOBAL~1.3xName in module globals (cached index)
LOAD_GLOBAL_BUILTINLOAD_GLOBAL~1.5xName in builtins (direct pointer)
CALL_PY_EXACT_ARGSCALL~1.5xPython function, exact positional arg count
CALL_BUILTIN_FASTCALL~2xC builtin like len, isinstance
LOAD_ATTR_SLOTLOAD_ATTR~1.5xAttribute is a __slots__ member
LOAD_ATTR_WITH_HINTLOAD_ATTR~1.3xInstance dict attr with cached dict version

Python 3.11: The New Frame Layout

The other major 3.11 change was redesigning the Python frame object.

Before 3.11: PyFrameObject

// Python 3.10 and earlier
typedef struct _frame {
PyObject_VAR_HEAD // Python object header (can be accessed as Python obj)
struct _frame *f_back; // Previous frame (linked list)
PyCodeObject *f_code; // Code object
PyObject *f_builtins;
PyObject *f_globals;
PyObject *f_locals; // Locals dict (always allocated, even if unused)
PyObject **f_valuestack; // Bottom of value stack (C pointer)
PyObject **f_stacktop; // Top of value stack
// ... many more fields for debugging, tracing, etc.
PyObject *f_localsplus[1]; // Locals + free vars + cell vars
} PyFrameObject;

Problems: every frame was a Python heap object (could be inspected, held by tracebacks), the value stack was a separate heap allocation, creating a frame required 2 heap allocations, and f_locals was always materialised as a dict even when not needed.

After 3.11: _PyInterpreterFrame

// Python 3.11+
// This is an INTERNAL frame - thin, lives on the C stack or thread state stack
typedef struct _PyInterpreterFrame {
PyCodeObject *f_code; // 8 bytes
_PyInterpreterFrame *previous; // 8 bytes (not a Python linked list)
PyObject *f_funcobj; // 8 bytes (the function being called)
PyObject *f_globals; // 8 bytes
PyObject *f_builtins; // 8 bytes
PyObject *f_locals; // 8 bytes (NULL until materialised)
PyFrameObject *frame_obj; // 8 bytes (NULL until Python code accesses it)
_Py_CODEUNIT *prev_instr; // 8 bytes (instruction pointer)
int f_lasti; // 4 bytes
uint16_t f_frame_state; // 2 bytes (created/suspended/executing/completed)
char owner; // 1 byte
// localsplus[] follows immediately in memory
PyObject *localsplus[1]; // locals + cells + freevars + value stack
} _PyInterpreterFrame;

Key improvements:

  • Zero-copy frames on the C stack - small frames (< 512 bytes) are allocated in the thread state's "frame stack" (a preallocated contiguous buffer), not the heap. No malloc() call per function invocation.
  • Lazy f_locals - the locals dict is only created when Python code actually calls inspect.currentframe().f_locals. For normal function calls, it is never allocated.
  • Lazy PyFrameObject - the visible Python frame object is only created if code introspects the stack. Tracebacks trigger frame materialisation; normal execution does not.

Result: function call overhead reduced by ~30%. This is the primary source of the Python 3.11 speedup.

import sys

# You can still access frames - they are materialised on demand
def show_overhead():
frame = sys._getframe() # Triggers PyFrameObject materialisation
print(f"Frame: {frame.f_code.co_name}")

# But normal function calls pay almost nothing for frame setup
import timeit

def no_frame_access(x):
return x + 1

def with_frame_access(x):
f = sys._getframe() # Materialises the frame
return x + 1

t1 = timeit.timeit(lambda: no_frame_access(1), number=5_000_000)
t2 = timeit.timeit(lambda: with_frame_access(1), number=5_000_000)
print(f"No frame access: {t1:.3f}s")
print(f"With frame access: {t2:.3f}s")
print(f"Overhead ratio: {t2/t1:.1f}x")
# Frame materialisation adds significant overhead - avoid in hot paths

Python 3.12: Immortal Objects

Python 3.12 introduced immortal objects (PEP 683). Certain objects - True, False, None, small integers, and a few other singletons - are declared immortal: their reference count is never modified.

The Problem Immortal Objects Solve

In CPython 3.11 and earlier, even accessing None required a Py_INCREF (increment refcount) and eventually a Py_DECREF (decrement refcount). Since None is referenced millions of times per second in a running program, these refcount updates:

  1. Dirty memory pages - writing to None's ob_refcnt marks its memory page as modified
  2. Prevent copy-on-write sharing - forked processes cannot share pages that are being written to
  3. Add cache pressure - every write to ob_refcnt invalidates CPU cache lines

For multiprocessing and pre-fork server models (Gunicorn, uWSGI), None, True, False, and small integers are accessed constantly by worker processes. Without immortal objects, every worker immediately dirtied the pages containing these objects, causing the OS to copy the pages (copy-on-write), increasing memory usage by hundreds of MB per worker.

How Immortal Objects Work

// Include/object.h (Python 3.12+)

// An immortal object has ob_refcnt set to this special sentinel value
#define _Py_IMMORTAL_REFCNT (Py_ssize_t)(UINT_MAX >> 2)
// On 64-bit: 4611686018427387903 (a very large number)

// The check (used in Py_DECREF before decrementing)
static inline int
_Py_IsImmortal(PyObject *op)
{
return op->ob_refcnt == _Py_IMMORTAL_REFCNT;
}

// Modified Py_DECREF: check immortality before decrementing
#define Py_DECREF(op) do { \
PyObject *_py_decref_tmp = (PyObject *)(op); \
if (_Py_IsImmortal(_py_decref_tmp)) { /* do nothing */ break; } \
if (--(_py_decref_tmp)->ob_refcnt == 0) { \
_Py_Dealloc(_py_decref_tmp); \
} \
} while (0)

Immortal objects in Python 3.12+:

import sys

# These are immortal - their refcount never changes
print(id(None))
print(id(True))
print(id(False))

# Small integers (-5 to 256) are also immortal in 3.12+
# Their ob_refcnt is set to _Py_IMMORTAL_REFCNT at startup

# Verify: getrefcount returns a huge number for immortal objects
print(sys.getrefcount(None)) # Returns _Py_IMMORTAL_REFCNT + 1 (very large)
print(sys.getrefcount(True)) # Same
print(sys.getrefcount(False)) # Same

# You can check your Python version
version = sys.version_info
print(f"Python {version.major}.{version.minor}")
# If 3.12+, None/True/False are immortal

The benefit is measurable in multiprocessing workloads:

import multiprocessing
import time
import sys

def worker_task(n):
"""Task that uses None, True, False extensively."""
results = []
for i in range(n):
x = None
flag = True if i % 2 == 0 else False
if x is None and flag:
results.append(i)
return len(results)

if __name__ == '__main__':
start = time.perf_counter()
with multiprocessing.Pool(4) as pool:
results = pool.map(worker_task, [500_000] * 4)
elapsed = time.perf_counter() - start
print(f"Python {sys.version_info.major}.{sys.version_info.minor}: {elapsed:.3f}s")
# Python 3.11: ~X.Xs (workers dirty pages for None/True/False)
# Python 3.12: slightly faster (immortal objects: no page writes)

Python 3.12: Per-Interpreter GIL

Python 3.12 implemented PEP 684, giving each sub-interpreter its own independent GIL:

# Python 3.12+ only
import sys
print(f"Python {sys.version_info.major}.{sys.version_info.minor}")

# The interpreters module (3.12+ experimental, 3.13+ stabilising)
try:
import _interpreters
print("Per-interpreter GIL available")

# Each new interpreter has:
# - Its own GIL (can run concurrently with other interpreters)
# - Its own sys.modules (module namespace isolation)
# - Its own memory arena (no shared heap)
# Limitation: Python objects CANNOT be shared directly between interpreters
# (They live in different heaps with different reference count trackers)

except ImportError:
print("_interpreters not available in this build")

# The interpreter config (controls per-interp GIL)
# PyInterpreterConfig.gil field:
# PyInterpreterConfig_DEFAULT_GIL = 0 (share main GIL)
# PyInterpreterConfig_SHARED_GIL = 1 (explicitly share)
# PyInterpreterConfig_OWN_GIL = 2 (own GIL - enables true concurrency)

Per-interpreter GIL enables a new concurrency model:

Process:
Main interpreter (with its own GIL)

├── Thread 1: interpreter A (GIL-A) ──running Python in parallel──┐
├── Thread 2: interpreter B (GIL-B) ──running Python in parallel──┤
└── Thread 3: interpreter C (GIL-C) ──running Python in parallel──┘

All three interpreters run Python bytecode simultaneously.
No GIL contention between them (each holds its own lock).
But: objects cannot be directly passed between interpreters.
Data sharing requires: pickle/unpickle, ctypes shared memory, or the new
interpreters.Queue (using marshal serialisation).

Python 3.13: Free-Threaded Mode

Python 3.13 ships the most consequential CPython change in decades: an optional free-threaded build that eliminates the GIL entirely.

Installing and Using the Free-Threaded Build

# Install the free-threaded build
# Ubuntu/Debian (Python 3.13+)
sudo apt install python3.13-nogil

# macOS via pyenv
pyenv install 3.13t
pyenv global 3.13t # 't' denotes free-threaded

# Build from source
./configure --disable-gil
make -j$(nproc)

# Verify
python3 -c "import sys; print(sys._is_gil_enabled())"
# True → GIL is enabled (standard build)
# False → GIL is disabled (free-threaded build)

# Runtime control (free-threaded build only)
# PYTHON_GIL=1 python3.13t script.py → enable GIL (compatibility mode)
# PYTHON_GIL=0 python3.13t script.py → disable GIL (default for 3.13t)

Architecture Change 1: Biased Reference Counting

The core problem with removing the GIL: ob_refcnt is a shared mutable integer. Without the GIL, concurrent writes cause data races.

The naive fix - make every Py_INCREF/Py_DECREF an atomic operation - is too expensive because atomics require CPU memory barriers (expensive on x86, very expensive on ARM).

The free-threaded build uses biased reference counting (based on research by Choi et al., 2018):

// Each object has two reference count components (free-threaded build):
struct _object {
Py_ssize_t ob_refcnt; // Shared refcount (atomic)
Py_ssize_t ob_ref_local; // Local refcount (non-atomic, owning thread only)
// The "owning thread" is the thread that created the object
};

// Increment from owning thread (fast path - no atomic):
void _Py_INCREF_owned(PyObject *op) {
op->ob_ref_local++; // Non-atomic - only called from owning thread
}

// Increment from non-owning thread (slower path - atomic):
void _Py_INCREF_shared(PyObject *op) {
_Py_atomic_add(&op->ob_refcnt, 1); // Atomic
}

// Decrement: merge local and shared counts when local hits 0
void _Py_DECREF(PyObject *op) {
if (--op->ob_ref_local == 0) {
// No more local refs - check shared refs
Py_ssize_t shared = _Py_atomic_load(&op->ob_refcnt);
if (shared == 0) {
// No refs anywhere - free the object
_Py_Dealloc(op);
}
}
}

The result: objects used only by their creating thread (the common case) pay zero atomic overhead. Only objects explicitly shared across threads incur atomic operations.

Architecture Change 2: Per-Object Locks

The GIL also protected operations that are not naturally atomic: dict resize, list append during resize, type method lookup. Without the GIL, these need explicit synchronisation.

The free-threaded build adds small embedded locks to mutable container types:

PyDictObject (free-threaded build):
┌─────────────────────────────┐
│ ob_refcnt (atomic) │ 8 bytes
│ ob_ref_local (non-atomic) │ 8 bytes (new field)
│ ob_type │ 8 bytes
│ _ob_mutex (embedded lock) │ 8 bytes (new field)
│ ma_used │ 8 bytes
│ ma_version_tag (atomic) │ 8 bytes
│ ma_keys │ 8 bytes
│ ma_values │ 8 bytes
└─────────────────────────────┘
Total: ~64 bytes vs ~40 bytes in the GIL build
# In the free-threaded build, dict operations are thread-safe
import threading
import sys

if hasattr(sys, '_is_gil_enabled') and not sys._is_gil_enabled():
print("Running in free-threaded mode")

# This is now safe - dict has a per-object lock:
shared_dict = {}
errors = []

def writer(thread_id):
for i in range(10_000):
shared_dict[f"key_{thread_id}_{i}"] = i

threads = [threading.Thread(target=writer, args=(i,)) for i in range(4)]
[t.start() for t in threads]
[t.join() for t in threads]

print(f"Dict has {len(shared_dict)} entries")
# In GIL build: 40,000 (or crashes due to race conditions)
# In free-threaded build: exactly 40,000

Current Performance Characteristics

As of Python 3.13, the free-threaded build has these characteristics:

import sys
import timeit
import threading

def cpu_task(n):
total = 0
for i in range(n):
total += i
return total

N = 1_000_000

# Single-threaded performance comparison
# (illustrative - run this on both standard and free-threaded builds)
t = timeit.timeit(lambda: cpu_task(N), number=5)
gil_state = "GIL disabled" if (hasattr(sys, '_is_gil_enabled') and
not sys._is_gil_enabled()) else "GIL enabled"
print(f"[{gil_state}] Single-thread {N} iterations × 5: {t:.3f}s")

# Typical results:
# GIL enabled (standard 3.13): ~2.1s (baseline)
# GIL disabled (free-threaded 3.13): ~2.7s (~28% slower single-threaded)
# Reason: biased refcounting + per-object lock overhead even in single-threaded code

# Multi-threaded CPU-bound with free-threaded:
start = __import__('time').perf_counter()
threads = [threading.Thread(target=cpu_task, args=(N,)) for _ in range(4)]
[t.start() for t in threads]
[t.join() for t in threads]
elapsed = __import__('time').perf_counter() - start
print(f"[{gil_state}] 4 threads × {N}: {elapsed:.3f}s")

# Typical results:
# GIL enabled: ~8.5s (serialised by GIL)
# GIL disabled: ~2.8s (~3x speedup with 4 threads - approaching linear scaling)

The performance roadmap: the CPython team expects the single-threaded overhead to decrease to under 5% within 2-3 releases, as the internal data structures and hot paths are optimised for the no-GIL model.

Python 3.13: Sub-Interpreters and the interpreters Module

Python 3.13 stabilised the interpreters module (PEP 734), providing a high-level API for sub-interpreter parallelism:

# Python 3.13+ only
import sys

if sys.version_info >= (3, 13):
try:
import interpreters
import interpreters.queues as queues

# Create a queue for inter-interpreter communication
q = queues.create()

# Create and run a sub-interpreter
interp = interpreters.create()

# Run code in the sub-interpreter (in a separate thread)
def run_in_interp():
interp.exec("""
import math
result = sum(math.sqrt(i) for i in range(100_000))
""")

import threading
t = threading.Thread(target=run_in_interp)
t.start()
t.join()

interp.close()
print("Sub-interpreter completed")

except ImportError:
print("interpreters module not available")

The key constraint: Python objects cannot be directly shared between interpreters. Inter-interpreter communication must use serialisation (marshal format, which supports basic types: int, float, str, bytes, list, dict, tuple, None, bool):

# Data sharing between interpreters uses channels/queues
# Objects are serialised with marshal on send and deserialised on receive

# This is fundamentally different from threads (shared memory)
# More like processes but with lower overhead (no process creation)

# Comparison:
# Threads: shared memory, low overhead, GIL limits CPU parallelism
# Sub-interpreters: no shared memory (use queues), medium overhead, no GIL between interps
# Processes: no shared memory (use IPC), high overhead (process creation), no GIL

What This Means for Library Authors

If you write C extensions or libraries targeting production Python, the 3.11-3.13 changes require attention:

Thread Safety Without the GIL

C extensions that assumed single-threaded bytecode execution are not safe in free-threaded mode:

// This is thread-unsafe without the GIL:
static int global_counter = 0;

static PyObject *
increment_counter(PyObject *self, PyObject *args)
{
global_counter++; // Data race: two threads may read same value
return PyLong_FromLong(global_counter);
}

// Thread-safe version using atomic operations:
#include <stdatomic.h>
static _Atomic int global_counter = 0;

static PyObject *
increment_counter_safe(PyObject *self, PyObject *args)
{
int result = atomic_fetch_add(&global_counter, 1) + 1;
return PyLong_FromLong(result);
}

The Py_GIL_DISABLED preprocessor flag lets you write conditional code:

#include "Python.h"

static PyObject *
my_function(PyObject *self, PyObject *args)
{
#ifdef Py_GIL_DISABLED
// Free-threaded build: use atomic operations or per-object locking
PyMutex_Lock(&my_module_lock);
// ... critical section ...
PyMutex_Unlock(&my_module_lock);
#else
// Standard build: GIL is held, no additional synchronisation needed
// ... critical section ...
#endif
Py_RETURN_NONE;
}

Declaring Thread Safety to Python

C extensions can declare their thread-safety status in their module definition:

static struct PyModuleDef_Slot module_slots[] = {
// Declare the module is safe to use without the GIL
{Py_mod_gil, Py_MOD_GIL_NOT_USED},
{0, NULL}
};

static struct PyModuleDef moduledef = {
PyModuleDef_HEAD_INIT,
"mymodule",
NULL,
0,
module_methods,
module_slots,
NULL,
NULL,
NULL
};

Without Py_MOD_GIL_NOT_USED, importing a C extension in free-threaded mode re-enables the GIL for the entire process (to maintain compatibility).

Migration and Compatibility Guide

Upgrading to Python 3.11

  • Check C extensions: if you have C extensions that use sys.getrefcount() for debugging, they still work. If they manipulate frame objects (PyFrameObject), audit them - the frame layout changed.
  • Benefit: ~25% performance improvement, free.
  • Gotcha: locals() inside optimised functions now returns a copy; modifying it does not affect actual locals. This was always the documented behaviour but some code relied on the CPython implementation detail.

Upgrading to Python 3.12

  • Benefit: immortal objects reduce memory pressure in forked processes. If you use Gunicorn/uWSGI with pre-forking, test the memory improvement.
  • Note: sys.getrefcount(None) now returns a very large number instead of a count in the millions. Code that checks the reference count of singletons (rare, but it exists) will break.
  • Per-interpreter GIL: only relevant if you use the C-level Py_NewInterpreter API.

Upgrading to Python 3.13 Standard Build

  • No breaking changes for most application code.
  • REPL improvement: the new REPL supports multi-line editing, syntax highlighting, and better completion.
  • locals() semantics fix: locals() in optimised frames now returns the actual current values at the moment of call (this was a long-standing inconsistency).

Evaluating Python 3.13 Free-Threaded Build

import sys

# Checking which build you are on
def check_build():
if not hasattr(sys, '_is_gil_enabled'):
return "Standard build (3.12 or older)"
if sys._is_gil_enabled():
return "Free-threaded build with GIL enabled (PYTHON_GIL=1)"
return "Free-threaded build with GIL disabled"

print(check_build())

# Before adopting free-threaded:
# 1. Profile your workload: is it actually CPU-bound? Threads are only beneficial
# for CPU-bound pure Python work. I/O-bound: asyncio is still better.
# 2. Audit C extensions: check their Py_mod_gil declarations.
# Unmaintained extensions without Py_MOD_GIL_NOT_USED will re-enable the GIL.
# 3. Measure: the 20-40% single-threaded overhead may outweigh threading gains
# for lightly parallel workloads.
# 4. Test thread safety: code that worked "by accident" under the GIL may have
# races in free-threaded mode.

# Useful environment variables for free-threaded testing:
# PYTHON_GIL=0 → disable GIL (in free-threaded builds)
# PYTHON_GIL=1 → enable GIL even in free-threaded builds (testing)
# PYTHONFAULTHANDLER=1 → dump stack on crash (useful for race conditions)
# PYTHONMALLOC=debug → memory corruption detection

Interview Q&A

Q1: What specifically changed in Python 3.11 to make it 25% faster? Why could not earlier versions achieve this?

Python 3.11 delivered its speedup through two complementary changes.

First, the specialising adaptive interpreter (PEP 659). At function compilation time, CPython emits generic opcodes like BINARY_OP and LOAD_GLOBAL. In 3.11, these opcodes observe what types flow through them at runtime. After ~8 executions, an opcode site that always sees int + int replaces itself with BINARY_OP_ADD_INT - a specialised variant that skips type dispatch and calls long_add() directly. Similarly, LOAD_GLOBAL becomes LOAD_GLOBAL_BUILTIN with a direct pointer to the builtin, skipping dict hash lookups. This reduces the cost of common operations by 1.3-2x.

Second, the frame layout redesign (_PyInterpreterFrame). Previously, every Python function call allocated a PyFrameObject on the heap - a Python object with a full object header, a separate f_locals dict, and a pointer-based value stack. In 3.11, function calls use _PyInterpreterFrame - a thin struct that lives on a per-thread "frame stack" (a preallocated contiguous buffer). No heap allocation, no malloc call per function invocation. The f_locals dict and the visible PyFrameObject are only created when code explicitly introspects the stack. This reduces function call overhead by ~30%.

Earlier versions could not achieve this because the specialising interpreter required a complete redesign of the opcode dispatch mechanism (adding adaptive opcodes, specialisation counters, deoptimisation paths), and the frame redesign required breaking the assumption that frame objects are always Python heap objects accessible via sys._getframe().

Q2: What are immortal objects in Python 3.12? What problem do they solve?

Immortal objects (PEP 683) are Python objects whose reference count is set to a special sentinel value (_Py_IMMORTAL_REFCNT, approximately 2^62 on 64-bit systems) and is never modified. True, False, None, small integers (-5 to 256), and some internal singletons are immortal.

The problem they solve: in CPython's reference counting model, every time code reads None, Python must call Py_INCREF(None) to reflect the new reference and eventually Py_DECREF(None). Incrementing ob_refcnt is a write to the memory page containing None. In a multi-process server (Gunicorn pre-fork, uWSGI), child processes initially share memory pages with the parent via copy-on-write. As soon as any child writes to a shared page - even just incrementing None's refcount - the OS must copy that page for the child's exclusive use. Since None is used in virtually every Python function (as a default return value, in if x is None checks, etc.), every worker immediately triggered copy-on-write for the pages containing these singletons. This caused each Gunicorn worker to have its own private copy of hundreds of KB of singleton objects that could have been shared, wasting memory.

With immortal objects, the Py_DECREF and Py_INCREF macros first check _Py_IsImmortal(op). For immortal objects, they do nothing - no write to ob_refcnt, no page dirtying, and copy-on-write pages remain shared across all worker processes.

Q3: What is biased reference counting? How does it enable free-threaded Python?

Biased reference counting is a technique for making reference counting efficient in a multithreaded environment without the GIL. Without the GIL, naively making every Py_INCREF/Py_DECREF atomic would require CPU memory barriers on every reference count operation - expensive because memory barriers force cache coherency across all cores.

The observation: most objects are used only by the thread that created them. If we split the reference count into a thread-local component (the "local count", non-atomic) and a shared component (the "shared count", atomic), we can avoid atomics in the common case.

When the owning thread increments a refcount: non-atomic ob_ref_local++. When any other thread increments a refcount: atomic ob_refcnt += 1. When the owning thread decrements: check --ob_ref_local == 0 then check shared count; if both are zero, deallocate. This means: objects that stay on one thread (the vast majority) pay zero atomic overhead. Only objects that are genuinely shared across threads pay the atomic cost.

The implementation also includes logic for "merging" the two count fields and for handling the edge case where the owning thread exits before all shared references are dropped.

Q4: What is the current state of the Python 3.13 free-threaded build? Should production systems use it?

As of early 2026, the free-threaded build is experimental and not recommended for production systems for most use cases.

The current limitations: (1) Single-threaded performance is 20-40% slower than the GIL build, due to the overhead of biased reference counting and per-object locking even when running on one thread. (2) C extensions that have not been updated to declare Py_MOD_GIL_NOT_USED will re-enable the GIL for the entire process on import, negating the benefit. Most widely-used extension libraries (numpy, pandas, cryptography, etc.) were actively updating their status through 2024-2025, but the ecosystem was not complete as of Python 3.13.0. (3) Some Python patterns that were safe under the GIL may have race conditions in free-threaded mode - auditing is required. (4) Memory usage per object is higher due to the additional ob_ref_local field and embedded per-object mutex.

For experimental use: it works for pure-Python CPU-bound workloads. Parallel speedup approaches linear for 4-8 threads on compute-heavy code. For production: wait for Python 3.14 or 3.15 when the single-threaded overhead is expected to drop below 5% and extension ecosystem compatibility is broader.

Q5: What changed in Python 3.11's frame system, and what are the implications for sys._getframe() and traceback cost?

Python 3.11 replaced PyFrameObject with _PyInterpreterFrame as the internal frame type used during function execution. _PyInterpreterFrame is a thin struct designed for performance: it lives on a thread state "datastack" (a preallocated contiguous buffer, not the heap), has no Python object header, and does not allocate a locals dict until required.

The implications: (1) sys._getframe() now creates a new frame object on demand - it materialises a PyFrameObject lazily from the internal _PyInterpreterFrame. This means sys._getframe() is slightly more expensive in 3.11+ than in 3.10, but normal function execution is much faster because no frame object is created. (2) Traceback cost - exceptions that include tracebacks cause frame materialisation for each frame in the call stack. This is still fast, but the cost is incurred at exception time rather than at call time. Code that raises exceptions in hot paths should be profiled to verify this trade-off. (3) f_locals dict - modifying frame.f_locals in a debugger or profiler no longer affects actual local variables in 3.11+ (this was always documented behaviour, but CPython 3.10 and earlier had an implementation detail where it sometimes worked). Code using ctypes tricks to modify locals in other frames must be updated. (4) Memory profilers and monitoring tools that held references to PyFrameObject objects to track call sites may see higher overhead in 3.11+ due to lazy materialisation.

© 2026 EngineersOfAI. All rights reserved.